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Learning to Recognize Pedestrian Attribute 

Yubin Deng, Ping Luo, Chen Change Loy, Member, IEEE, and Xiaoou Tang, Fellow, IEEE 


Abstract —Learning to recognize pedestrian attributes at far 
distance is a challenging problem in visual surveillance since 
face and body close-shots are hardly available; instead, only 
far-view image frames of pedestrian are given. In this study, 
we present an alternative approach that exploits the context of 
neighboring pedestrian images for improved attribute inference 
compared to the conventional SVM-based method. In addition, 
we conduct extensive experiments to evaluate the informativeness 
of background and foreground features for attribute recognition. 
Experiments are based on our newly released pedestrian attribute 
dataset, which is by far the largest and most diverse of its kind. 

Index Terms —Attribute recognition; visual surveillance. 

I. Introduction 

EARNING to recognize pedestrian attributes, such as 
gender, age, clothing style, has received growing at¬ 
tention in computer vision research, due to its high ap¬ 
plication potential in areas such as video-based business 
intelligence [ 17 ] and visual surveillance [ 6 ]. In real-world 
video surveillance scenarios, clear close-shots of face and body 
regions are seldom available. Thus, attribute recognition has to 
be performed at far distance using pedestrian body appearance 
(which can be partially occluded) in the absence of critical 
face/close-shot body visual information. 

Pedestrian attribute recognition at far distance is non¬ 
trivial due to: 1) Appearance diversity - owing to diverse 
appearances of pedestrian clothing and uncontrollable multi¬ 
factor variations such as illumination and camera viewing 
angle, there exist large intra-class variations among different 
images for the same attribute; 2) Appearance ambiguity - far- 
view attribute recognition is a remarkably difficult task due to 
limited image resolution, inherent visual ambiguity, and poor 
quality of visual features obtained from far view field (Fig. 1). 

Related work: Cao et al. [ 4 ] are among the first to study 
human attribute recognition from full body images. In their 
study, HOG features extracted from overlapping patches are 
used along with Adaboost classifier for recognizing the gender 
attribute. Bourdev et al. propose the use of poselets [1] 
to attribute recognition. In particular, HOG features, color 
histogram, and skin-specific features are extracted on local 
poses for poselet-level attribute classification. Zhu et al. [2 ] 
extract dense color, LBP, and HOG features to train Adaboost 
and weighted /cNN classifiers for attributes classification. 
Although these approaches have all tried to train a robust 
attribute detection model, they either relied on a small-size 
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Fig. 1. Sample images of far-view pedestrian and their corresponding binary 
parsing masks. Positive and negative samples are indicated by blue and red 
boxes, respectively. 


dataset or selected not enough attributes for analysis. In view 
of the growing research interest in the field of human re¬ 
identification [6], [20], [12], which aims at detecting the 
same person across spatial and temporal distance, the role of 
pedestrian attributes has become vital, as mid-level features are 
shown to be exceptional for aiding the human re-identification 
task [1 ] ]. In particular, Layne et al. [ ], [8] propose intersection 
kernel SVM with a mixture of colour (RGB, HSV and 
YCbCr) and texture histograms (8 Gabor filters and 13 Schmid 
filters) for learning a selection of pedestrian attributes as a 
form of mid-level features to describe people. The use of 
attributes has shown remarkable re-identification performance 
compared to employing low-level features alone, but the 
attribute recognition performance in [7], [£] has yet to be 
improved. 

Contributions: As discussed above, most existing pedestrian 
attribute studies focus either on feature engineering or clas¬ 
sifier learning. To better mitigate the appearance diversity 
and ambiguity issues, we explore some new perspectives 
of exploiting neighborhood and background contexts in this 
study: 1) We view multiple pedestrian images as forming 
an Markov Random Field (MRF) graph in order to exploit 
the hidden neighborhood information for better attribute 
recognition performance. The underlying graph topology is 
automatically inferred, with node associations weighted by 
pairwise similarity between pedestrian images. The similarity 
can be estimated as the conventional Euclidean distance or the 
more elaborated decision forest-based similarity with feature 
selection [22], [23]. By carrying out inference on the graph, 
we jointly reason and estimate the attribute probability of 
all images in the graph. 2) We extract foreground segments 
of pedestrian through deep learning-based parsing and exten¬ 
sively evaluate the integration of foreground segments with 
background context for improved pedestrian representation. 
All experiments are systematically conducted on the largest 
pedestrian attribute dataset introduced by us. 
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II. Methodology 

A. From Pedestrian Parsing to Representation 

The goal of pedestrian attribute recognition is to quantify an 
attribute with value, / u , given the d-dimensional feature vector, 
denoted by u G M d , of a pedestrian image. Conventionally the 
features are extracted from the whole pedestrian image defined 
by the detection bounding box [ 4 ], [ ], [8], [21], denoted as 

whole 

However, it is more intuitive to use only the features 
from foreground for attribute recognition. Would background 
regions play any role? We wish to examine if discarding the 
background region would facilitate more accurate recogni¬ 
tion of pedestrian attributes. To this end, we train a Deep 
Decompositional Network (DDN) [ 13 ] to parse a pedestrian 
image into different body regions. Such a deep network is 
an unified architecture that combines occlusion estimation, 
data completion, and data transformation for pedestrian seg¬ 
mentation and parsing, with each layer being fully connected 
to the next upper layer. We refer readers to [13] for the 
network structure and training details of the DDN due to 
page limits. At test time, the DDN parses the input image 
into multiple pedestrian regions. As depicted in Fig. 1, we 
define regions such as hair, face, body, arms, and legs of 
the pedestrian to be the foreground and we consider the 
remaining regions to be the background. Utilizing the binary 
masks (Fig. 1) produced by the DDN, we investigate the 
following combinations of features extracted from foreground 
u fore , background u back , and the whole image u whole , namely 
u whole a i one? u fore a fo ne? foreground and background feature 
concatenation (u fore , u back ), and foreground and whole image 
feature concatenation (u fore , u whole ). 

B. Recognition of Attributes using Neighborhood Context 

To improve attribute recognition, we further propose to 
exploit the context of neighboring images by Markov Random 
Field (MRF), which is an undirected graph, where each 
node represents a random variable and each edge represents 
the relation between two connected nodes. Traditionally, the 
neighborhood information in MRF is defined by using the 
nearby pixels in a single image, such as in the application of 
smoothing [9] in image segmentation [16]. In the context of 
attribute recognition, we hypothesize that neighboring images 
share natural invariance in their feature space, which could be 
treated as a form of regularization. As such, attribute inference 
of an image can be locally constrained by its neighbors to 
obtain a more reliable prediction. Hence in this work, we 
define the energy function of MRF over a graph G as follows 

e M rf{g) = c u (i u ) + y, $UV ^v) 5 ( 1 ) 

uEG uEGv£N(u) 

where n, v G G are two random variables in the graph and 
l u denotes the state of u. C u and S uv signify the unary cost 
and pairwise cost functions, respectively. More precisely, they 
indicate the cost of assigning state l u to variable u as well as 
the cost of assigning states to neighboring nodes n, v, which 
is determined based on the graph structure (e.g., assigning 


different states to nodes that are similar is penalized). N(u) 
is a set of variables that are the neighbors of u. 

Each random variable corresponds to an image and the 
relation between two variables corresponds to the similarity 
between images. The variable states l u are the values of the 
image attribute. The unary function is modeled by 

C u (l u ) = -\ogP(l u \u), ( 2 ) 

where P(l u \u) is the probability of predicting the attribute 
value of image u as l u . This probability can be conveniently 
mapped by the output scores of ikSVM. 

Now we consider the definition of the pairwise function. To 
define affinity between nodes, a simple way widely adopted 
by existing methods, such as [1 ], is the Gaussian kernel, 
exp{— }, in which u, v indicate the feature vectors 

of two images and cr is a coefficient that needs to be tuned. 
The graph built on this kernel function can model the global 
smoothness among images. However, when large variations are 
presented, one may consider modeling the local smoothness 
and discovering the intrinsic manifold of the data. Thus, 
an alternative is to employ the random forest (RF) [ 3 ] to 
learn the pairwise function [ 22 ], [2 ]. The RF we adopted is 
unsupervised, with pairwise sample similarity derived from the 
data partitioning discovered at the leaf nodes of RF as output. 
The unsupervised RF can be learned using the pseudo two- 
class method as in [ 22 ], [ 23 ] and [10]. The pairwise function 
in our MRF model can hence be expressed as 

P Yh=i exp {-dist^u, v)} if l u ± l v , 

0 otherwise. 

( 3 ) 

Here, disC^u^v) = 0 if u,v fall into the same leaf node 
and disC^u^v) = +oc otherwise, where t is the index of 
tree. Since the graph is dense, the inference of MRF is 
difficult. Thus, we build a fc-NN sparse graph by limiting 
the number of neighbors for each node. We set k = 5 in 
our experiment. Eq.(l) can be efficiently solved by the min- 
cut/max-flow algorithm introduced in [2]. 

III. Experiments 

A. Settings 

Feature representation: Low-level color and texture features 
have been proven robust in describing pedestrian images [8], 
including 8 color channels such as RGB, HSV, and YCbCr, 
and 21 texture channels obtained by the Gabor and Schmid 
filters on the luminance channel. The setting of the parameters 
of the Gabor and Schmid filters are given in [ ]. We 
horizontally partitioned the image region into six strips and 
then extracted the above feature channels, each of which is 
described by a bin-size of 16. To obtain u fore and u back , we 
apply the binary mask (Fig.l) to extract features separately 
from the foreground and background. 

Dataset: We present benchmark results on the PEdesTrian 
Attribute (PETA) dataset (Fig. 2) 1,2 introduced by us. This 
dataset is the largest and most diverse pedestrian attribute 
dataset to date. There are 61 binary attributes covering 

dataset download: http://mmlab.ie.cuhk.edu.hk/projects/PETA.html 

2 Images in PETA dataset are all exclusive from those in APiS [21]. 
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Fig. 2. The composition of the PETA dataset. 


an exhaustive set of characteristics of interest, including 
demographics (e.g. gender and age range), appearance (e.g.hair 
style), upper and lower body clothing style (e.g. casual or 
formal), and accessories. There are another four multi-class 
attributes that encompass 11 basic color namings [18], respec¬ 
tively for footwear, hair, upper-body clothing, and lower-body 
clothing. We selected 35 attributes for our study, consisting 
of the 15 most important attributes in video surveillance 
proposed by human experts [8], [1 ] and 20 difficult yet 
interesting attributes chosen by us, covering all body parts 
of the pedestrian and different prevalence of the attributes. 
For example, the attributes ‘sunglasses’ and ‘v-neck’ have a 
limited number of positive examples (Table I). We randomly 
partitioned the dataset images into 9,500 for training, 1,900 
for verification and 7,600 for testing. 

Comparisons: We compare the performance of intersection 
kernel SVM (ikSVM) [8], MRF with Gaussian kernel (MRFg), 
and MRF with random forest (MRFr), as discussed in Sec.II-B. 
For the attributes with unbalanced positives and negatives 
samples, we trained ikSVM for each attribute by augmenting 
the positive training examples to the same size as negative 
examples with small variations in scale and orientation. This 
is to avoid bias due to imbalanced data distribution. For MRFg 
and MRFr, we built the graphs using two different schemes. 
The first scheme, symbolized by MRFgl and MRFrl, is to 
construct the graphs with only the testing images. The second 
one, symbolized by MRFg2 and MRFr2, is to include both 
training and testing samples in the graphs. 

B. Results 

Evaluating the informativeness of the parsed regions: To in¬ 
vestigate the usefulness of foreground and background regions 
for attribute recognition, we first follow the previous study [7], 
[8] that applies intersection kernel SVM (ikSVM) [14]. Given 
the extracted foreground and background regions by DDN, 
we evaluate different representation schemes as discussed in 
Sec. II-A, i.e. u whole alone, u fore alone, foreground and back¬ 
ground feature concatenation (u fore , u back ), and foreground 
and whole image feature concatenation (u fore , u whole ). 

As shown in Table I, we observe that simply extracting the 
foreground features (u fore ) results in an inferior performance 
than that resulted from using the whole image. It suggests that 
background information is critical in facilitating the detection 


TABLE I 

Comparison of recognition accuracy between different 

FEATURE EXTRACTION SCHEMES. IKSVM IS USED AS THE CLASSIFIER. 


Attribute 

u whole 

u fore 

u fore 

u back 

u fore 

u whole 

Full Distribution 

Agel6-30 

80.4 

78.6 

78.9 

83.1 


Age31-45 

73.6 

71.9 

71.8 

77.6 


Age46-60 

73.1 

72.6 

72.3 

79.1 


AgeAbove60 

87.2 

89.5 

89.1 

93.5 


Backpack 

66.7 

64.4 

65.6 

70.7 


CarryingOther 

64.6 

59.4 

59.7 

66.9 


Casual lower 

GqciiqI i innpr* 

70.7 

70 3 

69.4 

70.1 

71 O 

76.5 

if* rt 


v^dsudi fipp^i 

Formal lower 

/U.J 

71.0 

\jy.y 

69.1 

/ 1 .W 

70.4 

/ o.u 

76.6 


Formal upper 

70.0 

69.2 

70.3 

76.8 


Hat 

82.3 

81.6 

83.2 

89.4 


Jacket 

67.7 

63.9 

64.7 

69.6 


Jeans 

74.9 

74.1 

75.8 

79.8 


Leather Shoes 

78.9 

76.9 

77.9 

84.0 


Logo 

51.1 

50.0 

50.3 

53.4 


Long hair 

71.5 

73.6 

74.5 

79.4 


Male 

79.7 

80.3 

80.0 

84.6 


MessengerBag 

71.8 

68.9 

68.8 

74.8 


Muffler 

88.0 

85.9 

86.9 

92.2 


No accessory 

76.8 

74.1 

74.0 

79.2 


No carrying 

70.4 

66.7 

68.0 

72.5 


Plaid 

64.0 

60.6 

59.3 

65.1 


Plastic bag 

74.9 

71.9 

73.2 

79.0 


Sandals 

Shoes 

50.6 

70 

50.6 

ei i 

51.3 

51.9 

77 O 


Shorts 

/U.O 

56.0 

o /. 1 

60.5 

OO. D 

61.6 

/ z.u 

65.2 


ShortSleeve 

71.3 

68.4 

69.9 

75.1 


Skirt 

64.0 

65.9 

64.7 

69.6 


Sneaker 

67.5 

64.9 

65.5 

71.5 


Stripes 

51.5 

50.4 

50.0 

51.9 


Sunglasses 

53.2 

ha n 

51.3 

52.6 

53.3 

77 O 


Trousers 

Tshirt 

/4.U 

64.3 

on i 

/ J.L 

61.0 

/ j.U 

62.8 

IQ A 

/ /.y 

71.1 

S3 7 


UpperOther 

oU. / 

ly.\j 

!y. 4 

oj.Z 


V-Neck 

51.1 

51.7 

53.3 

53.3 


AVERAGE 

69.5 

68.2 

68.8 

73.6 

blue: positive 


of attributes. If we inspect the recognition results of each 
attribute in detail, we observe that background plays a pivotal 
role for recognizing ‘Backpack’, ‘CarryingOther’, ‘Plastic 
bag’, and ‘No carrying’ attributes. This is reasonable since 
the visual evidence that corresponds to these attributes is not 
solely captured by the pedestrian foreground region. Moreover, 
slight drops of accuracy are observed on cloth-style related 
attributes, e.g. ‘Jeans’ and ‘Trousers’, if features are only 
extracted from the foreground. These results all suggest that 
background region could provide context for better attribute 
recognition performance. 

Extracting and concatenating features from the foreground 
and background ((u fore , u back )) sees a slight improvement 
for easy-to-spot attributes such as ‘AgeAbove60’, ‘Casual 
upper wear’, ‘Formal upper wear’, ‘Hat’, ‘Jeans’, ‘Long 
hair’, ‘Male’, ‘Shorts’, ‘Skirt’; however, the performance 
deteriorates for other attributes due to the inevitable noise 
contained in the features extracted from the background. 
Finally, when (u fore , u whole ) is adopted, a significant boost in 
the performance is observed, even for hard-to-spot attributes 
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Age 

Above60 

(93.8%) 


Hat 

(90.4%) 


Male 

(86.5%) 


Long hair 
(80.1%) 


Backpack 

(71.0%) 


Sunglasses 

(53.5%) 



Fig. 3. Examples of attribute recognition with forest-based MRF (MRFr2). 


like ‘Leather Shoes’ and ‘Plastic bag’. The (u fore , u whole ) 
scheme seems to a better way to exploit the information 
provided by the background. 

Evaluating the importance of neighborhood context: We 

choose the best three of the four feature extraction schemes, 
namely the u whole , (u fore , u back ), and (u fore , u whole ), and 
evaluate our proposed MRF methodology for detecting pedes¬ 
trian attributes. We report the attribute detection accuracy in 
Table II and list some further observations as follows. 

Firstly, the MRF-based methods outperform ikSVM on 
most of the attributes (comparing Table II with Table I). For 
instance, MRFr2 achieves an average of 3.4% improvement 
over ikSVM for the ‘age’ attributes shown on top of the tables. 
This is significant in a dataset with large appearance diversity 
and ambiguity and it demonstrates that graph regularization 
can improve attribute inference. In addition, an about 5% boost 
of performance is observed for attributes such as ‘Messen- 
gerBag’, ‘No accessory’, ‘No carrying’, and ‘Trousers’ and 
we observe a near 10% boost over ikSVM for ‘carryingOther’ 
and ‘Shoes’. Secondly, the MRF graphs built with the second 
scheme (graph constructed by both train and test samples) 
is superior compared to the first scheme (graph constructed 
by test samples only), which is reasonable as using both the 
training and testing data can better cover the image space. 
Thirdly, for many important attributes, such as ‘Trousers’ 
and ‘Shoes’, random forest works much better than Gaussian 
kernel to measure the neighborhood context. 

Moreover, we observed that for our proposed MRF methods, 
the importance of background information as context is 
best exploited when using (u fore , u whole ) (Table II). This 
observation corresponds with the detection performance using 
ikSVMs (Table I) and we show that the best result is obtained 
when we use MRFr2 with (u fore , u whole ), which on average 


TABFE II 

Recognition accuracy using Markov Random Field approaches. 


Attribute 

MRFgl 

MRFg2 

MRFrl 

MRFr2 

Agel6-30 

80.9,78.9,83.2 

81.7,78.9,83.2 

80.9,81.3,84.8 

83.8,83.1 ,86.8 

Age31-45 

74.6,72.3,78.0 

76.2,72.3,78.0 

74.0,72.2,80.4 

78.8,76.4, 83.1 

Age46-60 

74.1,72.6,79.3 

75.2,72.6,79.3 

73.2,72.6,78.8 

76.4,75.5, 80.1 

AgeAbove60 

87.2,89.2,93.4 

88.2,89.2,93.4 

86.3 , 88.7,90.6 

89.0,88.9, 93.8 

Backpack 

67.1,65.9,71.0 

67.1,65.9,71.0 

67.0,66.0,70.7 

67.2,66.1 ,70.5 

CarryingOther 

64.9,60.2,67.3 

66.8,60.2,67.3 

64.6,60.3,67.3 

68.0,67.0, 73.0 

Casual lower 

70.9,69.8,76.0 

71.6,69.8,76.1 

70.4,69.9,76.2 

71.3,70.9 ,78.2 

Casual upper 

70.4,70.7,75.4 

71.3,70.7,75.4 

69.8,70.3,75.9 

71.3,71.5 ,78.1 

Formal lower 

71.2,70.9,76.9 

71.8,70.9,77.0 

71.2,70.5,76.9 

71.9,70.5 ,79.0 

Formal upper 

70.3,70.8,77.0 

70.4,70.8,77.1 

70.3,70.8,77.1 

70.0,72.0, 78.7 

Hat 

82.9,83.2,89.5 

84.3,83.2,89.5 

82.3,82.4,88.8 

86.7,84.5, 90.4 

Jacket 

68.3,65.0,69.8 

68.4,65.0,69.8 

68.1,65.0,69.8 

67.9,66.8,72.2 

Jeans 

75.2,76.3 , 80.2 

76.1,76.3,80.2 

75.0,76.0,79.8 

76.0,75.9, 81.0 

Leather Shoes 

80.1,78.0,84.4 

80.9,78.0,84.4 

79.1,78.4,84.5 

81.7,82.5,87.2 

Logo 

51.1,50.5,53.8 

51.1,50.5,53.8 

51.1,50.0,53.8 

50.7,50.5,52.7 

Long hair 

71.7,75.2,79.5 

72.6,75.2,79.5 

71.8,75.1,79.6 

72.8,75.2, 80.1 

Male 

80.3,79.9,84.5 

80.9,79.9,84.5 

80.6,81.3,85.9 

81.4,81.9,86.5 

MessengerBag 

72.9,69.0,75.1 

74.3,69.0,75.1 

72.7,69.0,74.6 

75.5,73.8,78.3 

Muffler 

88.3,86.9,92.2 

89.5,86.9,92.2 

86.5,86.6,92.3 

91.3,87.9,93.7 

No accessory 

77.2,73.8,78.9 

78.6,73.8,78.9 

77.1,74.8,79.6 

80.0,78.5,82.7 

No carrying 

70.6,68.5,73.1 

71.6,68.5,73.1 

70.6,68.5,73.1 

71.5,69.6,76.5 

Plaid 

64.5,59.6,65.1 

64.5,59.6,65.1 

65.0,59.6,65.1 

65.0,59.6,65.2 

Plastic bag 

74.9,73.6,79.0 

75.5,73.6,79.0 

73.9,73.6,79.2 

75.5,74.1,81.3 

Sandals 

50.6,51.2,51.6 

50.6,51.2,51.6 

50.6,51.2,51.9 

50.6,51.3,52.2 

Shoes 

71.0,66.9,72.4 

72.5,66.9,72.4 

70.8,66.9,72.8 

73.6,73.1,78.4 

Shorts 

56.5,61.8,65.7 

56.5,61.8,65.7 

56.5,61.2,65.7 

56.5,61.8,65.2 

ShortSleeve 

71.7,70.5,75.4 

71.8,70.5,75.4 

71.8,70.6,74.0 

71.6,70.5,75.8 

Skirt 

64.0,65.3,69.6 

64.0,65.3,69.6 

64.0,65.0,69.6 

64.3,65.2,69.6 

Sneaker 

68.1,66.2,72.0 

69.0,66.2,72.0 

68.2,66.2,71.7 

69.3,66.4,75.0 

Stripes 

52.3,50.0,51.9 

52.3,50.0,51.9 

52.3,50.0,51.9 

52.3,50.0,51.9 

Sunglasses 

53.2,52.6,53.3 

53.2,52.6,53.3 

53.9, 52.6,53.5 

53.9,52.6,53.5 

Trousers 

74.5,72.9,77.9 

75.7,72.9,77.9 

75.7,76.5,80.9 

76.5,77.0,82.2 

Tshirt 

64.5,63.6,71.5 

64.6,63.6,71.5 

63.6,63.6,71.5 

64.2,63.6,71.4 

UpperOther 

80.7,79.3 , 83.2 

81.8,79.3,83.2 

81.1 ,81.4,84.3 

83.9,83.3 , 87.3 

V-Neck 

51.1,53.3,53.3 

51.1,53.3,53.3 

51.1 , 53.3,53.3 

51.1,53.3,53.3 

AVERAGE 

69.9,69.0,73.7 

70.6,69.0,73.7 

69.7,69.2,73.9 

71.2,70.6,75.6 


There are three small columns for each compared methods. They correspond to the three feature extraction 
schemes, i.e.u whole , (u fore , u back ), and (u fore , u whole ), respectively. 


outperforms the u whole scheme in our earlier preliminary 
result [5] by 4.4%. Fig. 3 shows some attribute recognition 
results using the forest MRF. The detection performance 
is satisfactory for most attributes. False negative samples 
typically result from occlusion (e.g. backpack), color ambi¬ 
guity (long hair) and background noise (male). All methods 
perform poorly on attributes with imbalanced positive-negative 
distribution (see Table I) such as ‘logo’, ‘stripes’, ‘v-neck’ and 
‘sunglasses’, which are also hard to spot by human observers. 

IV. CONCFUSIONS 

In this work, a novel approach to exploit the neighborhood 
information among image samples with emphasis on the 
foreground attribute regions has been investigated and the au¬ 
tomatically inferred pairwise graph topology has led to better 
performance of attribute recognition. Using the latest large- 
scale pedestrian attribute dataset (PETA) as the benchmark, we 
showed that our new MRF model with the proposed feature 
representation scheme is more capable to accurately detect 
pedestrian attributes. 
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